CIDO RESEARCH

Author

Joseph H

1 Intoduction

This report is one of 493,845 that I will make, and one of 104,070,413 that could be made.
I “toke” the 1.4 TB Linked-In data that was breached in 2020, and turned it into some insights to power my job HUNT.
The insights I could share in this report, that are also related to my goals, are:
- Industry base recruitment trend.
- Company base workforce timeline.
- Current/part workforce info:
- Basic info: Name, job title, status, social link. I could add geo-location for some that have the data, but it would look creepy.
- Their work period.
- Their experiences.

2 About me

Salutations; I’m Joseph, a self-taught data analyst, engineer, and scraper.
Despite life’s challenges, my goal remains a remote job, either full or part-time, and having friends to tackle the challenges of this changing world with.
To show my skills and dedication, I made this project that yielded this tailored report.

3 About the project

3.1 How this project comes to life?

You would know by now, from my email, that I am hunting for a job.
About a year ago, I scraped contact info from Google Map to get my first job. Later I scraped contact from Linked-In website… you can check how that went in here.

Recently, I finally got to learning SQL because of DuckDB, it is a software that allows you to process big data in your local machine by using storage space as RAM; Then I remembered about a leaked Linked-In data that I couldn’t process.
Thus my journey started to learn SQL, process the data, and make something out of it.

3.2 The process

The process was done in my local machine, and it was as followed.

3.2.1 Downloaded the leaked data

I downloaded the data from a torrent.
There was around 700 .gz file, each is around 280 Mb; 196 GB in total.
Each .gz file contain a 2 GB file; 1.4 TB in total.
Each file have multiple lines, and each one of them is a JSON; Not the file is a JSON, it just have multiple JSONs, one in each line.

3.2.2 Processing the weird data

I this phase I created a script that automatically open an archive, process the file, and save it as a Parquet file with compression level of 22.
I used Python, Pathlib, Polars, and a lot of patience.
The process toke around 20 minutes per file, in total it toke around three weeks (I had to shutdown my PC at night) The result was 700 parquet files, each is around 190 Mb; 133 GB in total.

3.2.3 Making relational database

The data in the datasets were nested, especially the “experience” field, it had the experience of a person and the company info; The problem is that the company info get repeated multiple tiles, across all datasets.
Making a relational database will solve this, and make the exploratory data analysis easier.
The code was split in two:
1. I used Polars to split each of the 700 datasets into mini relational databases.
2. I used DuckDB to merge all the mini relational databases and remove duplicates in some, mainly company and university information’s.

The result was a relational database that is 73 GB in size; From 1.4 TB to 73 GB.

All of this is using my PC, so servers were harmed, only my CPU fan and my ear.

3.2.4 Filter

I filtered out companies base on their industry, country, and whether I have the email of one of the higher ups.

4 General graphs

4.1 market research indestry’s yearly new recruit count

4.2 cido research’s workforce status over the years

5 Workforce sample

5.1 A Saad Imran

Job title: ****
Associated: True
Socials: https://linkedin.com/in/a-saad-imran-b3559895

5.1.1 A Saad Imran’s working period at cido research

5.1.2 Gantt plot of A Saad Imran’s experience


5.2 Anamika Patnaik

Job title: Qc
Associated: True
Socials: https://linkedin.com/in/anamikapatnaik

5.2.1 Anamika Patnaik’s working period at cido research

5.2.2 Gantt plot of Anamika Patnaik’s experience


5.3 Cessandra Latinovich

Job title: Field director
Associated: True
Socials: https://linkedin.com/in/cessandra-latinovich-b5b93b6b

5.3.1 Cessandra Latinovich’s working period at cido research

5.3.2 Gantt plot of Cessandra Latinovich’s experience


5.4 Chris Liu

Job title: Tourism market researcher
Associated: False
Socials: https://linkedin.com/in/chrisliupm

5.4.1 Chris Liu’s working period at cido research

5.4.2 Gantt plot of Chris Liu’s experience


5.5 Claus Günnewig

Job title: Head of back office
Associated: True
Socials: https://linkedin.com/in/claus-günnewig-918a08179

5.5.1 Claus Günnewig’s working period at cido research

5.5.2 Gantt plot of Claus Günnewig’s experience


5.6 Eileen Chen

Job title: Market research analyst
Associated: False
Socials: https://linkedin.com/in/eileenlc | https://linkedin.com/in/eileen-l-chen-30908363 | https://linkedin.com/in/lin-chen-30908363

5.6.1 Eileen Chen’s working period at cido research

5.6.2 Gantt plot of Eileen Chen’s experience


5.7 Erman Akcay

Job title: Market research intern
Associated: False
Socials: https://linkedin.com/in/erman-akcay-394a9447

5.7.1 Erman Akcay’s working period at cido research

5.7.2 Gantt plot of Erman Akcay’s experience


5.8 François-Charles Humbert

Job title: Bilingual customer service representative
Associated: False
Socials: https://linkedin.com/in/franã§ois-charles-humbert-bb12455a | https://linkedin.com/in/françois-charles-humbert-bb12455a

5.8.1 François-Charles Humbert’s working period at cido research

5.8.2 Gantt plot of François-Charles Humbert’s experience


5.9 Gustavo Bolaños

Job title: Quality assurence analyst and analista de control de calidad
Associated: False
Socials: https://linkedin.com/in/gustavo-loría-bolaños-7301b265

5.9.1 Gustavo Bolaños’s working period at cido research

5.9.2 Gantt plot of Gustavo Bolaños’s experience


5.10 Igor Pawlenko

Job title: Research interviewer
Associated: False
Socials: https://linkedin.com/in/igor-angelo-pawlenko-89a02b50

5.10.1 Igor Pawlenko’s working period at cido research

5.10.2 Gantt plot of Igor Pawlenko’s experience


5.11 Jackee Wong

Job title: Translator and market researcher
Associated: False
Socials: https://linkedin.com/in/jackeewong | https://linkedin.com/in/jackee-wong-b6887878

5.11.1 Jackee Wong’s working period at cido research

5.11.2 Gantt plot of Jackee Wong’s experience


5.12 Mahwash Tasawar

Job title: Associate
Associated: True
Socials: https://linkedin.com/in/mahwash-tasawar-17710965

5.12.1 Mahwash Tasawar’s working period at cido research

5.12.2 Gantt plot of Mahwash Tasawar’s experience


5.13 Natthakan Jeengao

Job title: Phone interviewer
Associated: True
Socials: https://linkedin.com/in/natthakan-jeengao-1416a3153

5.13.1 Natthakan Jeengao’s working period at cido research

5.13.2 Gantt plot of Natthakan Jeengao’s experience


5.14 Oi Chan

Job title: Clerical assistant
Associated: True
Socials: https://linkedin.com/in/oi-hei-chan-6a4458166

5.14.1 Oi Chan’s working period at cido research

5.14.2 Gantt plot of Oi Chan’s experience


5.15 Patrick Kiplagat

Job title: Senior data processing analyst
Associated: True
Socials: https://linkedin.com/in/pkiplagat | https://twitter.com/kiplagat

5.15.1 Patrick Kiplagat’s working period at cido research

5.15.2 Gantt plot of Patrick Kiplagat’s experience


5.16 Rose Baker

Job title: Phone interviewer
Associated: False
Socials: https://linkedin.com/in/rose-baker-ostiguy-2197169b | https://facebook.com/rosebakerlove

5.16.1 Rose Baker’s working period at cido research

5.16.2 Gantt plot of Rose Baker’s experience


5.17 Sam Ho

Job title: English team manager
Associated: False
Socials: https://linkedin.com/in/sam-ho-b73b0169

5.17.1 Sam Ho’s working period at cido research

5.17.2 Gantt plot of Sam Ho’s experience


5.18 Sandy Au-Yeung

Job title: Assistant data collection manager
Associated: True
Socials: https://linkedin.com/in/sandy-au-yeung-197556bb

5.18.1 Sandy Au-Yeung’s working period at cido research

5.18.2 Gantt plot of Sandy Au-Yeung’s experience


5.19 Suki Chan

Job title: Interviewer
Associated: True
Socials: https://linkedin.com/in/suki-chan-28752a95

5.19.1 Suki Chan’s working period at cido research

5.19.2 Gantt plot of Suki Chan’s experience


5.20 Thomas Zurowski

Job title: Telephone interviewer
Associated: True
Socials: https://linkedin.com/in/thomas-zurowski-144090116

5.20.1 Thomas Zurowski’s working period at cido research

5.20.2 Gantt plot of Thomas Zurowski’s experience


5.21 Vindya Seneviratne

Job title: Bilingual interviewer
Associated: False
Socials: https://linkedin.com/in/vindya-seneviratne-2072b423 | https://facebook.com/vindyas

5.21.1 Vindya Seneviratne’s working period at cido research

5.21.2 Gantt plot of Vindya Seneviratne’s experience


5.22 Wah Chiu

Job title: ʹé
Associated: True
Socials: https://linkedin.com/in/wah-chiu-4a7996173

5.22.1 Wah Chiu’s working period at cido research

5.22.2 Gantt plot of Wah Chiu’s experience


5.23 Zore Fernández

Job title: Panel support
Associated: True
Socials: https://linkedin.com/in/zore-fernández-361979a5

5.23.1 Zore Fernández’s working period at cido research

5.23.2 Gantt plot of Zore Fernández’s experience